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Abstract 


Preclinical assessment is a useful strategy for promoting skill improvement in the clinical phase. It 
enables early intervention in the learning process and promotes effective use of training resources. 
The study aims to assess inter-examiner variability in class II cavity preparation performed by un- 
dergraduate dental students’ evaluations using different scoring methods. The study evaluated 20 
undergraduate students performing two Class II amalgam preparations performed on plastic molar 
teeth. The preparations were evaluated by four blinded independent examiners using two methods 
viz., Modified Neelakantan method and objective checklist scoring method. Statistical analysis for 
inter and intra examiner variability was tested using Friedman test and Wilcoxon signed rank test, 
respectively. The Kruskal-Wallis H test was employed to analyze scoring system variability and ex- 
aminer consistency. The results showed that, scoring method (I) tends to have higher ranks than 
scoring method (II), the findings suggest that both Scoring method I and Scoring method II are reli- 
able and consistent tools for evaluating Class II cavity preparations, with good inter-examiner agree- 
ment and intra-examiner reliability. Conclusion: The most important conclusion of our study is that 
both scoring methods are reliable for evaluating. Class II cavity preparations, with minimal inter- 
and intra-examiner variability. This suggests that these scoring methods can be used with confidence 
in pre-clinical practice, as they provide a consistent and accurate way to assess the quality of Class 


II cavity preparations. 
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Introduction 


The transition from pre-clinical training to clinical training in dentistry is a special time in 
the student's life because of the wide range of challenges, that the student must overcome. 
It is a difficult stage that is possibly one of the most important in the development of a career 
identity on both a technical and a personal level [1]. Preclinical laboratory instruction in the 
field of operative dentistry integrates exercises and tasks [2]. 

Second-year students at the university of dentistry in Tripoli, Libya, are examined with Class 
II cavity preparation for amalgam filling, although it is rarely used on Libyan patients. On 
the other hand, amalgam is still widely used in many developing countries [3]. 

This training requires assessment, which is designed to determine a student's level of 
knowledge, behavior, or skill development. It can be used not just to formally recognize the 
acquisition of knowledge or abilities but also to support learning and give students feedback 
on how they performed [4]. Since most students prefer to focus more on assessments and 
their results than any other aspect of the educational process, assessments are an essential 
component of the learning process [5]. The optimal assessment concept should have out- 
standing characteristics such as reliability, validity, accountability, flexibility, comprehen- 
siveness, feasibility, timeliness, and reliance [6,7]. 

Problems with examiner consistency may lead students to recognize that evaluation methods 
are somewhat uninformed [5]. This concept can determine the learning process and produce 
a negative effect on undergraduate confidence and performance [7]. A method of assessment 
of both objectivity and reliability is essential [8], therefore, to endorse an effective system 
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of learning and to reduce friction between students and faculty over the issue of grading, 
objective, reliable, and practical methods need to be applied [9]. Examiner consistency is 
critical in the teaching and training process because it can affect the confidence and perfor- 
mance of the students. Therefore, new evaluation techniques and methods of standardizing 
assessments need to be further studied to encourage an efficient system of learning [10]. 

It is essential to highlight that inter-examiner reliability or agreement estimates the degree 
of consistency or agreement among examiners when assessing the results of the same group 
of students on the same task [11], whereas intra-examiner reliability or agreement describes 
the consistency of a single examiner in grading the same sample on two different occasions 
[12]. 

This study aimed to assess the impact of inter- and intra-examiner variability, both within 
and between examiners, on second-year dental students' Class II tooth preparation scores 
using two methods of evaluation at the school of dentistry at university of Tripoli. 


Methods 

Study Design 

This study involved 20 second-year undergraduate dental students. The students attended a 
2-hour didactic lecture on Class II cavity preparation for amalgam, followed by a 1-hour 
demonstration session, and they exercised one hour per week for three weeks. After the 
lecture and demonstration, the students performed Class II cavity preparations on teeth un- 
der controlled conditions. 


Data Collection 

The tooth models were collected after a 30-minute preparation period, adhering to the uni- 
versity examination time schedule. Each preparation was assigned a number code to ensure 
blinding. Four independent examiners evaluated the preparations in a double-blind manner 
using two different scoring systems. The evaluations were conducted without magnification, 
using an explorer and a mouth mirror under illumination. The assessments were performed 
in two stages, with a two-week interval between evaluations. 


Study residents 

Four independent faculty members from the school of dentistry at university of Tripoli, each 
with over ten years of clinical and teaching experience, served as examiners. These exam- 
iners were not involved in designing the checklist or scoring criteria and did not undergo 
specific calibration procedures. They were only briefed on the scoring distribution and cri- 
teria. 


Methods of assessment 

In the first stage, the examiners evaluated the preparations using the modified Neelakantan 
method (subjective study = scoring method I) [13]. After a two-week interval, the same 
examiners re-evaluated the same preparations using the objective checklist criteria scoring 
method (objective study = scoring method II) [14]. This was followed by a second evalua- 
tion two weeks later using the other scoring method two times with a two-week interval. E. 
Khalaf et al. [13] employed the Neelakantan method for student self-assessment, which we 
adapted to fit professor and examiner evaluation. It was chosen because it most closely re- 
sembles the college's evaluation procedure. It has only four subjective evaluation points, 
making it easier than Scoring Method II, which is deemed analytical. 


Data analysis 

The data were analyzed using non-parametric tests due to their non-normal distribution. The 
Wilcoxon signed-rank test was used to compare evaluation scores between the two scoring 
systems. Intra-examiner variability in evaluating Class II cavity preparations using each 
scoring system was assessed using the Friedman test. Inter-examiner variability in evaluat- 
ing Class II cavity preparations using each scoring system was also analyzed using the Fried- 
man test. The Kruskal-Wallis H test was employed to analyze scoring system variability and 
examiner consistency. 


Results 

Compare the evaluation scores between the two scoring systems. 

Table 1 presents a comparative analysis revealing a statistically significant difference be- 
tween the two scoring systems using the Wilcoxon Signed-Rank Test. A closer examination 
of the test results shows that both negative Z-values were statistically significant: Z = -4.545 
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for the first evaluation and Z = -5.833 for the second evaluation. These values indicate that 
Scoring System II tends to have lower ranks than scoring system I in both comparisons. 
Furthermore, the asymptotic significance values (0.001) confirm that these results are highly 
significant. 


Table 1. Wilcoxon Signed-Rank Test: Comparative analysis of system I and system II 


Comparison Mean rank Z-value Asymp. sig. 
Scoring systemlII - Scoring 
40.88 -4.545b .001 
System I (First evaluation) 
Scoring systemII - Scoring system 
42.02 -5.833b .001 


I (second evaluation) 


Z = Z Value, Asymp. Sig. = Asymptomatic Significance 


Inter-examiner variability in evaluating class II cavity preparations using each scor- 
ing system 

Assessment of inter-examiner variability using scoring system I: 

In table 2: the Friedman test was employed to assess inter-examiner variability in evaluating 
Class II cavity preparations using Scoring System I. The results revealed no statistically 
significant differences between the first and second evaluations across all four examiners, 
with p-values exceeding 0.05 (Table 3). Specifically, the p-values for each examiner were 
as follows: Examiner 1 (p = 0.251), Examiner 2 (p = 0.796), Examiner 3 (p = 0.637), and 
Examiner 4 (p = 0.796). These findings indicate that there is no significant inter-examiner 
variability in evaluating Class II cavity preparations using Scoring System I. 


Table 2. Friedman test results for assessing inter- examiner variability in evaluating class IT cav- 
ity preparations (first evaluation) 


Mean rank Mean rank 
Examiner Chi-square | Asymp. sig. 
(First evaluations) (Second evaluations) 
Examiner 1 1.38 1.63 1.316 0.251 
Examiner2 1.53 1.48 .067 0.796 
Examiner3 1.55 1.45 222 0.637 
Examiner4 1.53 1.48 .067 0.796 


Asymp. sig. = Asymptomatic significance 


Assessment of inter-examiner variability using scoring system IT 

Table 3 presents the results of the Friedman test assessing inter-examiner variability using 
scoring system II. No statistically significant differences were found between the first and 
second evaluations across all four examiners, with p-values exceeding the conventional 
significance threshold (a = 0.05). Specifically, the p-values for each examiner were as 
follows: examiner 1 (p = 0.346), examiner 2 (p = 0.166), examiner 3 (p = 0.251), and 
examiner 4 (p = 0.109). these findings indicate that there is no statistically significant inter- 
examiner variability in evaluating class ii cavity preparations using scoring system II. 


Table 3: Friedman test results for assessing inter-examiner variability in evaluating class II cavity 
preparations (scoring system I) 


Examiner Mean rank Mean rank Chi- Asymp. sig. 


(First evaluations) | (Sec evaluations) square 


Examiner 1 1.60 1.40 0.889 346 
Examiner 2 1.38 1.63 1.923 .166 
Examiner 3 1.63 1.38 1.316 251 
Examiner 4 1.65 1.35 2.571 109 


Asymp. Sig. = Asymptomatic significance 
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Intra-examiner variability in evaluating class II cavity preparations using each scor- 
ing system 

Assessment of intra-examiner variability using scoring system I 

The Wilcoxon Signed Ranks Test was conducted to assess intra-examiner variability in eval- 
uating Class II cavity preparations using Scoring System I (Table 4). The test revealed that 
there were no significant differences between the two evaluations for all examiners, as indi- 
cated by asymptotic significance (p-value) greater than 0.05. 

These results suggest that all examiners demonstrated good intra-examiner reliability when 
evaluating Class II cavity preparations using Scoring System I, as there were no significant 
differences between their repeated evaluations. In other words, each examiner was con- 
sistent in their scoring and did not show any significant variation in their evaluation of Class 
II cavity preparations using Scoring System I over time. 


Table 4. Wilcoxon Signed Ranks Test results for assessing intra-examiner variability in evaluat- 
ing Class II cavity preparations using Scoring System I 


Examiner Z Asymp. sig. 
Examiner 1 - 1.608 0.108 
Examiner 2 -0.233 0.816 
Examiner 3 -0.418 0.676 
Examiner 4 -0.631 0.528 


Asymp. Sig. = Asymptomatic significance 


Assessment of intra-examiner variability using the scoring system, IT 

The Wilcoxon signed ranks test revealed that there were no significant differences between 
the two evaluations for all examiners, as indicated by asymptotic significance (p-value) 
greater than 0.05. Specifically, the results showed that Examiner | had a Z value of -0.066 
and an asymptotic significance of 0.194, Examiner 2 had a Z value of -1.686 and an asymp- 
totic significance of 0.083, Examiner 3 had a Z value of -1.733 and an asymptotic signifi- 
cance of 0.676, and Examiner 4 had a Z value of -0.381 and an asymptotic significance of 
0.703 (table 5). 


Table 5. Wilcoxon signed ranks test results for assessing intra-examiner variability in evaluating 
class II cavity preparations scoring system IT 


Examiner Z Asymp. sig. 
Examiner 1 -.066 0.194 
Examiner 2 -1.686 0.083 
Examiner 3 -1.733 0.676 
Examiner 4 -0.381 0.703 


Asymp. Sig. = Asymptomatic significance 


Scoring System Variability: An Analysis of Examiner Consistency 

An analysis of four examiners' scoring patterns across four different systems revealed some 
interesting findings. While there were no significant differences in mean scores between ex- 
aminers for Scoring System I's first evaluation, with a Kruskal-Wallis H value of 0.283 and 
an asymptotic significance of 0.963, there was a marginally significant difference for its sec- 
ond evaluation, with a Kruskal-Wallis H value of 6.056 and an asymptotic significance of 
0.109, suggesting some variation in scoring approaches. In contrast, Scoring System II 
showed no significant differences in mean scores between examiners for both evaluations, 
with Kruskal-Wallis H values of 1.441 and 2.219, and asymptotic significances of 0.696 and 
0.528, respectively (table 6). 
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Table 6. Mean ranks by examiner and scoring system 


Asumini Scoring system, | Scoring system, I | Scoring system | Scoring system 
I (First) (Second) II (First) II (Second) 
Examiner 1 41.85 49.33 43.20 45.08 
Examiner 2 41.50 43.68 35.20 40.03 
Examiner 3 38.90 34.50 41.80 34.58 
Examiner 4 39.40 34.50 41.80 42.33 
Kruskal-Wallis H 0.283 6.056 1.441 2.219 
Asymp. Sig. 0.963 0.109 0.696 0.528 
Discussion 


Pre-clinical exercises are an important way for students to improve the manual skills re- 
quired to reach high competency levels in restorative dentistry [15]. To support an effective 
learning system and eliminate grading friction between learners and teachers, an objective 
and reliable assessment approach is required. In the current study, it was found that the 
students’ scores were higher when using the scoring method I than when using the scoring 
method II, whether among the same assessor both times or among the results of all. The high 
scores of students in the scoring method I are most likely due to the fact that it is simple, as 
it consists of only four evaluation points for each toothand does not cover the exact details 
of the work and therefore may overlook some errors, unlike the scoring method II, which is 
analytical, consisting of fourteen points, is more accurate, and gives time for the examiner 
to investigate the preparation details [16]. The scoring method II used in this study was based 
on earlier studies byHaj-Ali et al., and Park et al., as these scoring distributions more closely 
met the criteria of allowing feedback and reflecting at what stage or stages of preparation, 
students were performing poorly [17, 18]. 

In this study, we find that there was minimum variability between the examiners in both 
methods. A comparable result was achieved in studies by Bazan and Seale [19] and 
Schmitt et al [20] where a similarly conceived examiner’s checklist for evaluation led to a 
similar reliability value. This agreement between the raters in our results may be due to the 
educational and evaluation experience of the four raters, despite their lack of knowledge of 
the two evaluation methods at the time of the research. However, experience has been shown 
to enhance inter-examiner agreement, with statistically significant variations amongst ex- 
aminers [21]. 

The results obtained in this study were dissimilar to those seen in studies by Sherwood A. 
& Douglas V [14], Mhanni [16], Sharaff et al. [8], Vann et al. [22], Lilley et al. [23], and 
Satterthwaite & Grey [24]. This variation might be related to the preparation design (only 
one surface of class II) that examined .in the current study unlike eight different preparation 
designs and four preparation designs as in the studies by [16] and [14], respectively, which 
could have made the examiners better adjusted to our results for assessing students’ prepa- 
rations, leading to an improvement in their reliability in scoring. According to the results of 
[20], the reliability value can be increased by a higher number of examiners. This is rela- 
tively consistent with our study, where the teeth were evaluated by four examiners instead 
of three or two, as in other studies [8,14, 23,24], This interpretation is contrary to study of 
Vann et al. [22], which did not find agreement despite dealing with six examiners. 
Moreover, the study revealed that, among the four examiners, the intra-examiner variability 
was non-significant. These results were similar to a study by Sharaf et al., [8] by showing 
that there was a reduction in intra-examiner variation when an objective check-list criteria 
scoring system was used. Our results were similar to those of the third examiner in the study 
of [14], he used in his studies five different evaluation methods, from descriptive to nearly 
subjective to a more analytical method, and his results had significant intra-examiner varia- 
bility, except for the third examiner, and the validity of his evaluation was confirmed [16]. 
On the other hand, a marginally significant difference in mean scores was found between 
examiners for Scoring System I (second occasion), This is somewhat similar to studies by 
[8], [14], and [16] when they used subjective or nearly subjective scoring methods. The 
similarity in results may be due to the fact the fact that both evaluations are not analytical 
as an objective checklist criterion, which gives the opportunity to precisely define the eval- 
uation, unlike descriptive or nearly subjective methods, which give opportunity to the ex- 
aminer's personal opinion and sometimes his mood, which’s led to limiting the variability 
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among examiners. Geopferd and Kerber used an analytical system for evaluation, using spe- 
cific criteria; they reported that the strategy reduced variability among examiners more ef- 
fectively in objective checklist criteria than the glance and grade method [25]. 

However, with the current study's scoring method I had the lowest intra-rater reliability 
compared to inter-raters, particularly on the second occasion. similar to the results of Lilley 
et al. [23] and Fuller [26], although the relationship in this study is not significant compared 
to their studies. While the present results differ from the studies of Houpt and Kress [27] 
and Deranleau et al. [26], they suggest that fewer points of evaluation are more likely to 
result in higher agreement. While in the current study, the method with lower evaluation 
points had relatively less agreement between the same rater both times, that may be attribut- 
able to the experience of the four examiners in our study in accurately evaluating students 
they were familiar with, even if they collaborated with it for the first time in this study and 
used the analytical method in an excellent manner. 

It is important to mention that the scoring method I takes less time to evaluate than the 
scoring method II, as the average time taken to evaluate all teeth does not exceed 25 minutes 
compared to the scoring method II (50 minutes). According to Caro et al. (1979) [29], if the 
examiners spent more than 30 minutes assessing the tooth preparations, they might become 
tired and impact the result of the assessment also in Mhanni's A study indicated that exam- 
iners felt somewhat exhausted following the twelfth examinations [16]. 

Finally, this clear agreement between inter and intra examiners, especially in the objective 
checklist criteria, this indicates the reliability of the two methods and the extent of the ex- 
aminers’ compatibility and experience in dealing with the different methods. 


Conclusions 

The results suggest that there is no statistically significant inter-examiner or intra-examiner variability 
in evaluating Class II cavity preparations using either Scoring System I or Scoring System II. How- 
ever, a marginally significant difference in mean scores was found between examiners for Scoring 
System I (Second occasion), suggesting possible differences in scoring patterns. 


Limitations 

Scoring system I may require further refinement to reduce potential variation in scoring approaches 
among examiners. In addition, the small sample size used in this study may negatively affect the re- 
sults. The use of a Wilcoxon signed-rank test and a Friedman test may have limitations in terms of 
assumptions and interpretation of results. 


Recommendations 

Based on the conclusion, the recommendations could be: Both Scoring Systems I and II are reliable 
and consistent methods for evaluating Class II cavity preparations. The study suggests that both scor- 
ing systems can be used interchangeably, as they demonstrate minimal inter- and intra-examiner var- 
iability. Using the magnification, students prepare the teeth, and at the examiner's evaluation, they 
compare the results to the current study. The use of computerized dental assessment was proposed to 
overcome the limitations of using typodont in preclinical dental teaching compared to the visual 
method. Entering the students’ self-assessment and evaluating its reliability and credibility, to improve 
self-directed learning. 
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